This report explores which chemical properties influence the quality of red wines.

Univariate Plots Section

The report explores a dataset containing quality and 11 features for 1599 red wines observations.

## [1] 1599   12
## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000
##        fixed.acidity     volatile.acidity          citric.acid 
##                    0                    0                    0 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##                    0                    0                    0 
## total.sulfur.dioxide              density                   pH 
##                    0                    0                    0 
##            sulphates              alcohol              quality 
##                    0                    0                    0

Looking for the number of NA values for each column in the dataframe. It appears that none are missing.

##                             [,1]
## fixed.acidity         0.12405165
## volatile.acidity     -0.39055778
## citric.acid           0.22637251
## residual.sugar        0.01373164
## chlorides            -0.12890656
## free.sulfur.dioxide  -0.05065606
## total.sulfur.dioxide -0.18510029
## density              -0.17491923
## pH                   -0.05773139
## sulphates             0.25139708
## alcohol               0.47616632

Correlation showing all variables against quality. It appears that four attributes have a weak to moderate correlation (either negative or positive) with quality: volatile.acidity, citric.acid, sulphates, and alcohol.

##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity           1.00000000     -0.256130895  0.67170343
## volatile.acidity       -0.25613089      1.000000000 -0.55249568
## citric.acid             0.67170343     -0.552495685  1.00000000
## residual.sugar          0.11477672      0.001917882  0.14357716
## chlorides               0.09370519      0.061297772  0.20382291
## free.sulfur.dioxide    -0.15379419     -0.010503827 -0.06097813
## total.sulfur.dioxide   -0.11318144      0.076470005  0.03553302
## density                 0.66804729      0.022026232  0.36494718
## pH                     -0.68297819      0.234937294 -0.54190414
## sulphates               0.18300566     -0.260986685  0.31277004
## alcohol                -0.06166827     -0.202288027  0.10990325
## quality                 0.12405165     -0.390557780  0.22637251
##                      residual.sugar    chlorides free.sulfur.dioxide
## fixed.acidity           0.114776724  0.093705186        -0.153794193
## volatile.acidity        0.001917882  0.061297772        -0.010503827
## citric.acid             0.143577162  0.203822914        -0.060978129
## residual.sugar          1.000000000  0.055609535         0.187048995
## chlorides               0.055609535  1.000000000         0.005562147
## free.sulfur.dioxide     0.187048995  0.005562147         1.000000000
## total.sulfur.dioxide    0.203027882  0.047400468         0.667666450
## density                 0.355283371  0.200632327        -0.021945831
## pH                     -0.085652422 -0.265026131         0.070377499
## sulphates               0.005527121  0.371260481         0.051657572
## alcohol                 0.042075437 -0.221140545        -0.069408354
## quality                 0.013731637 -0.128906560        -0.050656057
##                      total.sulfur.dioxide     density          pH
## fixed.acidity                 -0.11318144  0.66804729 -0.68297819
## volatile.acidity               0.07647000  0.02202623  0.23493729
## citric.acid                    0.03553302  0.36494718 -0.54190414
## residual.sugar                 0.20302788  0.35528337 -0.08565242
## chlorides                      0.04740047  0.20063233 -0.26502613
## free.sulfur.dioxide            0.66766645 -0.02194583  0.07037750
## total.sulfur.dioxide           1.00000000  0.07126948 -0.06649456
## density                        0.07126948  1.00000000 -0.34169933
## pH                            -0.06649456 -0.34169933  1.00000000
## sulphates                      0.04294684  0.14850641 -0.19664760
## alcohol                       -0.20565394 -0.49617977  0.20563251
## quality                       -0.18510029 -0.17491923 -0.05773139
##                         sulphates     alcohol     quality
## fixed.acidity         0.183005664 -0.06166827  0.12405165
## volatile.acidity     -0.260986685 -0.20228803 -0.39055778
## citric.acid           0.312770044  0.10990325  0.22637251
## residual.sugar        0.005527121  0.04207544  0.01373164
## chlorides             0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide   0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide  0.042946836 -0.20565394 -0.18510029
## density               0.148506412 -0.49617977 -0.17491923
## pH                   -0.196647602  0.20563251 -0.05773139
## sulphates             1.000000000  0.09359475  0.25139708
## alcohol               0.093594750  1.00000000  0.47616632
## quality               0.251397079  0.47616632  1.00000000

Correlations between all variables.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

The qualities conform to a fairly normal distribution. While the scores limits were 0-10, no wines fell below 3 or scored above 8 and most falling below a 6.

Explore the four variables with the highest correlations with quality

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800
## [1] 0.6703331

Volatile acidity is positively skewed.

## [1] 1599   12
## [1] 1580   13
## [1] 19
## [1] -0.3634752

Removal of outliers did not improved the correlation with quality.

Squareroot of squareroot appears to give the best normal distribution.

## [1] -0.3934108

But it doesn’t create a much stronger correlation.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000
## [1] 0.3177403

Citric acid appears to be positively skewed, but the jump around .5 reduces the skewness measure.

## [1] 0.2324389

The removal of outliers did not improve correlation which makes sense since the chart doesn’t appear to show any outliers.

Through all the transforms, it appears that squareroot creates the most normal distribution, but still has a large number of wines with almost no citric acid.

## [1] 0.2066822

And the squareroot actually lowers the correlation.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000
## [1] 2.424118

Sulphates are highly positively skewed.

## [1] 0.3940654

Removing sulphate significantly improves correlation with quality from .25 to .39

Reciprical transform shows the best normal distribution.

## [1] -0.3403317

This actually increased correlation and turned it negative. We’ll explore both options in the bivariate section.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90
## [1] 0.8592144

Alcohol has a positive skew.

## [1] 0.4710238

Removing alcohol outliers did not improve correlation with quality.

The best normalization is created by the squareroot transform.

## [1] 0.4768205

No real change in corrlation, but we’ll plot both in the bivariate section.

Univariate Analysis

What is the structure of your dataset?

There are 1,599 red wines in the dataset with 12 features (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol and quality).

  • Most red wines are of qualities 5 and 6.
  • The median quality is 6.
  • The mean alcohol is wines was 10.42.
  • About 75% of red wines had sulphates less than .73.

What is/are the main feature(s) of interest in your dataset?

The variables volatile acidity, citric acid, sulphates, and alcohol have the highest correlations with quality.

Of the features you investigated, were there any unusual distributions?
I checked for NA values and found none.

For the four features with the highest correlations, I removed outliers and reperformed correlations with quality. This only made a significant increase in correlation with sulphates.

I also performed log, squareroot, squareroot of squareroot, cube root, and reciprocal transforms for all four features with the highest correlation to determine if more normal distribution could be created and higher correlations with quality could be confirmed. Higher correlations were not achieved.

Bivariate Plots Section

The negative correlation is obvious in the trendline between volatile acidity and quality.

The weak positive correlation can be seen in the trendline between citric acid and quality.

The reciprocal sulphates show a much stronger negative trend with quality.

Both alchohol and squareroot transform appear to have the strongest trendlines we’ve seen with quality.

Now we’ll see see how the four attributes with the hightest correlations with quality correlate with each other.

## [1] -0.5524957

Pretty strong negative correlation between volatile acidity and citric acid, but both of those attributes could be correlated merely because they are acids in the wine.

## [1] -0.2609867

Weak correlation between volatile acidity and sulphates.

## [1] -0.202288

Weak correlation between volatile acidity and alcohol.

## [1] 0.31277

Medium correlation between citric acid and sulphates.

## [1] 0.1099032

Very little correlation between citric acid and alcohol.

## [1] 0.09359475

Very little correlation between sulphates and alchohol.

Bivariate Analysis

Alcohol appears the be the highest correlation with quality followed by sulphates as far as positive correlations. Volatile acidity, citric acidity, and sulphates have medium correlations with each other.

The most interesting item is that positively correlated with each other, both citric acid and sulphates have very low correlations with alcohol. This points to a combination of either alcohol and citric acid or sulphates being excellent attributes to use in prediction.

Multivariate Plots Section

Showing the qualities bucketed by alcohol, it’s obvious that alcohol have a greater impact since no quality 8 shows below the 10.5-12 bucket, but the line charts do show the negative correlation with volatile acidity as the quality lines gradually move lower on the graph.

The positive correlations of both features are evident in both charts, but also the relatively low correlation values. There is a large spread for citric acid at the 12-16 bucket of alcohol. Maybe citric acid has a larger importance in determining quality at lower levels of alcohol content.

Sulphates show an obvious positive correlation with quality in both charts and appear to be less dependent on the alcohol quantities than citric acid.

Multivariate Analysis

The box plots revealed a lot of nuances in the data.

Volatile acidity’s negative correlation appears to come mostly from extremely high amounts. This is shown in the 10.5-12 alcohol bucket where the 3 quality factor shows a large spike in the amount of volatile acidity.

Citric acid appears to have more of an impact on quality while in the lower alcohol buckets. Once the alcohol hits the highest bucket, the citric acid ranges for quality levels 6, 7, and 8 have large spreads and overlap each other.

Sulphates appear to have fairly tight ranges for all quality levels in each of the alcohol buckets. In addition, sulphate quality levels appears to be less impacted by the alcohol buckets.

Final Plots and Summary

Plot One

Description One

Alcohol had the highest positive correlation with wine quality. This makes sense as one of the primary reason to have an alcoholic beverage in the first place is for alcohol. At around 7 level quality the vast majority of those wines contain an alcohol percentage greater than 10%.

Plot Two

Description Two

Sulphates had the second highest positive correlation with quality. Sulphates are additives to wines which acts as antimicrobial and antioxidant agents. These preserve the wines so perhaps an increase in sulphates would produce less likelihood that the wine tasted would have gone bad.

Plot Three

Description Three

This shows little correlation between the two highest positively correlated attributes to wine quality in the dataset, sulphates and alcohol. Since we’re ultimately trying to find the attributes which influence the quality of wine and possibly to predict the quality based on these attributes, it’s important that the features are not redundant. Redundant attributes lead to a model which overfits predictions.

Reflections

As wine quality was pretty much a categorical value containing mostly values of 5 or 6, these highly influenced the appearance of the graphs correlating with quality. I was hoping that some of the transforms would give a higher correlation with quality than just the normal attribute, but I didn’t see any real evidence of this with the transforms I created.

Some limitations are due to the volume of data. 1599 records is not a large dataset, perhaps I should have chosen the white wines instead. To investigate the data further, I would like to see a larger set. In addition, while the quality measure was a median of three wine experts, I would also like to see the mean in order to show a more continuous variable quality measurement.